DATA 612 Graduate Project

This is the R code used to import the data into R.

Seattle <- read_csv("2016-building-energy-benchmarking.csv")
view(Seattle)

The names of the columns are quite self-explanatory; however, we will further define them.

OSEBuildingID: ID of building BuildingType: Type and purpose of building PrimaryPropertyType: Main use or function building PropertyName: Name of the building on property Address: Address where building is located City: City where building is located State: Statew here building is located ZipCode: Zipe code of building CouncilDistrictCode: Council district where building is located Neighborhood: Colloquial name of city area where building is located Latitude: Latitudinal coordinate of building Longitude: Longitudinal coordinate of building YearBuilt: Year in which the building was built NumberofBuildings: Number of separate buildings which constitute the entire unit NumberofFloors: Number of floors which a building contains PropertyGFATotal: (Gross Floor Area) of Parking PropertyGFABuilding(s): (Gross Floor Area) of the building that is non-parking ListOfAllPropertyUseTypes: Lists all the uses of of the building. This is a non-orthodox way of recording data as there are multiple data in one cell. For the purposes of this project, we do not need you wrangle and tidy this column.

Here, we will set Seattle as a data from and as a tibble.

Seattle_t <- as_tibble(Seattle)
Seattle_df <- as.data.frame(Seattle)

Let’s identify and characterize the Seattle table. Here, the head() function will state which columns are doubles and which are characters. For these data, all the character columns are categorical variables and all the double columns are quantitative except for OSEBuildingID. Even though OSEBuildingID is a double, it is a categorical variable in reality since taking an average of the values or even regressing OSEBuildingID to some other variable would be arbitrary and non-sensical.

#glimpse(Seattle)
head(Seattle)

## # A tibble: 6 × 19
##   OSEBuildingID BuildingType  PrimaryPropertyT… PropertyName Address City  State
##           <dbl> <chr>         <chr>             <chr>        <chr>   <chr> <chr>
## 1             1 NonResidenti… Hotel             Mayflower p… 405 Ol… Seat… WA   
## 2             2 NonResidenti… Hotel             Paramount H… 724 Pi… Seat… WA   
## 3             3 NonResidenti… Hotel             5673-The We… 1900 5… Seat… WA   
## 4             5 NonResidenti… Hotel             HOTEL MAX    620 ST… Seat… WA   
## 5             8 NonResidenti… Hotel             WARWICK SEA… 401 LE… Seat… WA   
## 6             9 Nonresidenti… Other             West Precin… 810 Vi… Seat… WA   
## # … with 12 more variables: ZipCode <dbl>, CouncilDistrictCode <dbl>,
## #   Neighborhood <chr>, Latitude <dbl>, Longitude <dbl>, YearBuilt <dbl>,
## #   NumberofBuildings <dbl>, NumberofFloors <dbl>, PropertyGFATotal <dbl>,
## #   PropertyGFAParking <dbl>, PropertyGFABuilding(s) <dbl>,
## #   ListOfAllPropertyUseTypes <chr>

We will now list the unique observations in the ListOfAllPropertyUseTypes column.

Seattle %>%
  count(ListOfAllPropertyUseTypes) %>%
  filter(n>1)

## # A tibble: 153 × 2
##    ListOfAllPropertyUseTypes                           n
##    <chr>                                           <int>
##  1 Automobile Dealership                               2
##  2 Automobile Dealership, Parking                      4
##  3 Bank Branch                                         2
##  4 Bank Branch, Office                                 3
##  5 Bank Branch, Office, Parking                        2
##  6 Bank Branch, Parking                                2
##  7 College/University                                 21
##  8 College/University, Parking                         2
##  9 Data Center, Distribution Center, Office, Other     2
## 10 Data Center, Office                                 7
## # … with 143 more rows

For the sake of this project, we will filter all the Hotel values.

Seattle%>%
  filter(ListOfAllPropertyUseTypes %in% c("Hotel", 
                                      "Hotel, Multifamily Housing, Parking, Retail Store",
                                      "Hotel, Parking",
                                      "Hotel, Parking, Restaurant",
                                      "Hotel, Parking, Swimming Pool",
                                      "Hotel, Restaurant",
                                      "Hotel, Retail Store",
                                      "Hotel, Swimming Pool"))-> Seattle_hotel

Seattle_hotel %>%
  unite(Coordinates, Latitude, Longitude, sep = ",") -> Seattle_hotel

head(Seattle_hotel)

## # A tibble: 6 × 18
##   OSEBuildingID BuildingType  PrimaryProperty… PropertyName  Address City  State
##           <dbl> <chr>         <chr>            <chr>         <chr>   <chr> <chr>
## 1             1 NonResidenti… Hotel            Mayflower pa… 405 Ol… Seat… WA   
## 2             2 NonResidenti… Hotel            Paramount Ho… 724 Pi… Seat… WA   
## 3             3 NonResidenti… Hotel            5673-The Wes… 1900 5… Seat… WA   
## 4             5 NonResidenti… Hotel            HOTEL MAX     620 ST… Seat… WA   
## 5             8 NonResidenti… Hotel            WARWICK SEAT… 401 LE… Seat… WA   
## 6            10 NonResidenti… Hotel            Camlin        1619 9… Seat… WA   
## # … with 11 more variables: ZipCode <dbl>, CouncilDistrictCode <dbl>,
## #   Neighborhood <chr>, Coordinates <chr>, YearBuilt <dbl>,
## #   NumberofBuildings <dbl>, NumberofFloors <dbl>, PropertyGFATotal <dbl>,
## #   PropertyGFAParking <dbl>, PropertyGFABuilding(s) <dbl>,
## #   ListOfAllPropertyUseTypes <chr>

Here we have explanatory data visualizations. We though that there could be some relationship between NumberofFloors and YearBuilt, our guess was that over time buildings got taller. This explanatory analysis strongly suggests that a linear regression cannot be used to study the relationship between NumberofFloors and YearBuilt.

ggplot(data = Seattle_hotel) +
  geom_point(mapping = aes(x = YearBuilt, y = NumberofFloors))

We used a bar plot to visualize how many hotels are in each neighborhood.

ggplot(data = Seattle_hotel) +
  geom_bar(mapping = aes(x = Neighborhood, fill = Neighborhood))

This is a scatter plot which plots PropertyGFATotal against NumberofFloors, and color codes points according to Neighborhood.

ggplot(data = Seattle_hotel) +
  geom_point(mapping = aes(x = PropertyGFATotal, y = NumberofFloors, color = Neighborhood)) #+

   #ggtitle("Mammal Body Weight & Total Sleep Amount")

This is a regression which studies the relationship between YearBuilt and PropertyGFABuilding(s).

ggplot(data = Seattle_hotel) +
  geom_point(mapping = aes(x = YearBuilt, y = `PropertyGFABuilding(s)`))

cor(Seattle_hotel$YearBuilt,Seattle_hotel$`PropertyGFABuilding(s)`)

## [1] 0.1707993

lm_year_building <- lm(YearBuilt ~ `PropertyGFABuilding(s)`, data = Seattle_hotel)
tidy(lm_year_building)

## # A tibble: 2 × 5
##   term                         estimate std.error statistic   p.value
##   <chr>                           <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)              1958.        5.63         348.   1.68e-116
## 2 `PropertyGFABuilding(s)`    0.0000415 0.0000284      1.46 1.49e-  1

This is a regression which studies the relationship between PropertyGFABuilding(s) and PropertyGFATotal.

ggplot(data = Seattle_hotel) +
  geom_point(mapping = aes(x = `PropertyGFABuilding(s)`, y = PropertyGFATotal))

cor(Seattle_hotel$`PropertyGFABuilding(s)`,Seattle_hotel$PropertyGFATotal)

## [1] 0.9835021

#lm_Hotel_Building_GFA <- lm()


lm_GFA <- lm(`PropertyGFABuilding(s)` ~ PropertyGFATotal, data = Seattle_hotel)
tidy(lm_GFA)

## # A tibble: 2 × 5
##   term             estimate std.error statistic  p.value
##   <chr>               <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)      2620.    4246.         0.617 5.39e- 1
## 2 PropertyGFATotal    0.835    0.0182    45.8   1.82e-54

This is a regression which studies the relationship between PropertyGFABuilding(s) and PropertyGFAParking.

ggplot(data = Seattle_hotel) +
  geom_point(mapping = aes(x = `PropertyGFABuilding(s)`, y = PropertyGFAParking))

cor(Seattle_hotel$`PropertyGFABuilding(s)`,Seattle_hotel$PropertyGFAParking)

## [1] 0.5955192

lm_year_building <- lm(`PropertyGFABuilding(s)` ~ PropertyGFAParking, data = Seattle_hotel)
tidy(lm_year_building)

## # A tibble: 2 × 5
##   term               estimate std.error statistic      p.value
##   <chr>                 <dbl>     <dbl>     <dbl>        <dbl>
## 1 (Intercept)        79767.   16366.         4.87 0.00000645  
## 2 PropertyGFAParking     2.25     0.360      6.25 0.0000000274

#PropertyGFABuilding(s)

This is a regression which studies the relationship between PropertyGFABuilding(s) and PropertyGFATotal.

ggplot(data = Seattle_hotel) +
  geom_point(mapping = aes(x = `PropertyGFABuilding(s)`, y = PropertyGFATotal))

cor(Seattle_hotel$`PropertyGFABuilding(s)`,Seattle_hotel$PropertyGFATotal)

## [1] 0.9835021

lm_year_building <- lm(`PropertyGFABuilding(s)` ~ PropertyGFATotal, data = Seattle_hotel)
tidy(lm_year_building)

## # A tibble: 2 × 5
##   term             estimate std.error statistic  p.value
##   <chr>               <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)      2620.    4246.         0.617 5.39e- 1
## 2 PropertyGFATotal    0.835    0.0182    45.8   1.82e-54

DATA 612 Graduate Project

Max Calzada Munoz, Abigail Tata

12/1/2021